Data Dumping System
Overview
A Python-based solution for exporting large datasets from MySQL databases with smart batching and Docker containerization.
Features
Current Implementation
-
Supported Dump Types
zm: ZM Data Dump (requires date range)mcct: MCCT Data Dump (requires date range)bcu: Daily BCU Data Dumps (single snapshot)vaud: Vaccinator Audit For Health Ministry (single snapshot)
-
Core Functionality
- 🗓️ Weekly batch processing for date-range dumps
- 🚰 10k-row streaming using server-side cursors
- 📦 GZIP compression for output files
- 📊 Automatic database report download entries
Execution Commands
# Date-range dumps (zm/mcct)
docker compose run --rm dumper --dump {TYPE} {START_DATE} {END_DATE}
# Single-snapshot dumps (bcu/vaud)
docker compose run --rm dumper --dump {TYPE}
Scheduling
** Cron Jobs ** Currently, we're using cron jobs to schedule the BCU dump at 5:01pm every day. Vaccinator Audit dump is also scheduled to run at 10:30am, 1:30pm, and 5:00pm. Both of these dumps handle the date and time range using native SQL DATETIME fucntions. This can be improved.
Docker Integration
# docker-compose.yml excerpt
services:
dumper:
env_file: .env
volumes:
- ./data:/app/data
- ./server_dumps:/mnt/server_dumps
environment:
- QUERY_FILE_ZM=./queries/zm.sql
- QUERY_FILE_MCCT=./queries/mcct.sql
Limitations
Technical Constraints
-
Parallel Execution
❗ Cannot run simultaneous dumps due to:- Hardcoded container name
- Ephemeral container strategy (container is destroyed after each dump task)
-
Configuration Rigidity
Adding new dump types requires:+ code/dumper.py (Code modifications)
+ docker-compose.yml (ENV variables)
+ queries/new_dump.sql (SQL file) -
Date Handling
- Fixed 7-day batches
- No time granularity support
Roadmap
Phase 1: Dynamic Configuration
# Proposed config/dumps.yaml
dumps:
zm:
query_file: queries/zm.sql
date_range: required
batch_days: 7
bcu:
query_file: queries/bcu.sql
date_range: prohibited
Phase 2: Parallel Execution Architecture
graph TD
A[API Request] --> B(RabbitMQ)
B --> C[Worker 1] --> E[(MySQL)]
B --> D[Worker 2] --> E[(MySQL)]
Phase 3: Enhanced Filtering
# Proposed time-range support
docker compose run --rm dumper \
--dump zm \
--start "2024-01-01 08:00" \
--end "2024-01-07 18:00" \
--batch-hours 4 \
--batch-rows 10000
Technical Specifications
| Component | Technology | Purpose |
|---|---|---|
| Database Driver | PyMySQL | MySQL connection management |
| Batch Processing | SSCursor | Server-side result streaming |
| Compression | gzip | Output file compression |
| Orchestration | Docker Compose | Container lifecycle management |
Maintenance Notes
-
Adding New Dump Types
Current process requires modifying multiple files. Track progress on dynamic configuration proposal. -
Memory Management
The 10k batch size was tested for:- 512MB RAM containers
- Datasets up to 200M rows